{{% toc %}}

Introduction

For this first blog post I thought I’d take a recent #TidyTuesday dataset and do some analysis of it. Where better to start than one of my favourite things? Cocktails! You can find the dataset and information about it here.

Firstly I’d like to acknowledge Julia Silge and David Robinson, whose approaches to analysing this data I enjoyed. In this post I’ve taken some of their ideas and then developed them for my own purposes.

I will first clean the data, do some exploratory analysis, then apply some unsupervised learning techniques to see what patterns we can find.

Let’s load the cocktails receipes dataset and take a quick look at it:

boston_cocktails <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv")

boston_cocktails %>% head()
## # A tibble: 6 x 6
##   name         category        row_id ingredient_number ingredient       measure
##   <chr>        <chr>            <dbl>             <dbl> <chr>            <chr>  
## 1 Gauguin      Cocktail Class…      1                 1 Light Rum        2 oz   
## 2 Gauguin      Cocktail Class…      1                 2 Passion Fruit S… 1 oz   
## 3 Gauguin      Cocktail Class…      1                 3 Lemon Juice      1 oz   
## 4 Gauguin      Cocktail Class…      1                 4 Lime Juice       1 oz   
## 5 Fort Lauder… Cocktail Class…      2                 1 Light Rum        1 1/2 …
## 6 Fort Lauder… Cocktail Class…      2                 2 Sweet Vermouth   1/2 oz

So basically on each line we have the name of the cocktail, its category, then each ingredient in it and the amount of it required. Let’s see what the top ingredients are:

boston_cocktails %>%
  count(ingredient, sort = TRUE) %>% 
  slice(1:20)
## # A tibble: 20 x 2
##    ingredient            n
##    <chr>             <int>
##  1 Gin                 176
##  2 Fresh lemon juice   138
##  3 Simple Syrup        115
##  4 Vodka               114
##  5 Light Rum           113
##  6 Dry Vermouth        107
##  7 Fresh Lime Juice    107
##  8 Triple Sec          107
##  9 Powdered Sugar       90
## 10 Grenadine            85
## 11 Sweet Vermouth       83
## 12 Brandy               80
## 13 Lemon Juice          70
## 14 Blanco tequila       69
## 15 Egg White            67
## 16 Angostura Bitters    63
## 17 Juice of a Lemon     60
## 18 Pineapple Juice      47
## 19 Bourbon whiskey      38
## 20 Orange Bitters       38

Look at entries 2, 13 and 17. It’s all just lemon juice with different names! We need to do some data cleaning. In the next section we modify the ingredient column to deal with various issues described in the in-line comments.

# Credit to Julia Silge for much of this code
cocktails_parsed <- boston_cocktails %>%
  mutate(
    ingredient = str_to_lower(ingredient),  # make everything lower case
    ingredient = str_replace_all(ingredient, "-", " "), # Remove dashes
    # Remove some stop words
    ingredient = str_remove(ingredient, " liqueur$"),
    ingredient = str_remove(ingredient, " (if desired)$"),
    # Consolidate all types of bitters into 'bitters', lemons to 'lemon juice', whiskies to 'whiskey', etc
    ingredient = case_when(
      str_detect(ingredient, "bitters") ~ "bitters",
      str_detect(ingredient, "lemon") ~ "lemon juice",
      str_detect(ingredient, "lime") ~ "lime juice",
      str_detect(ingredient, "grapefruit") ~ "grapefruit juice",
      str_detect(ingredient, "orange") ~ "orange juice",
      str_detect(ingredient, "whiskey") ~ "whiskey",
      str_detect(ingredient, "rum") ~ "rum",
      TRUE ~ ingredient
    ),
    # Looks like units of bitters are erroneously given as ounces - change this to 'dash'
    measure = case_when(
      str_detect(ingredient, "bitters") ~ str_replace(measure, "oz$", "dash"),
      TRUE ~ measure
    ),
    # Change fractions to decimals in the measure column
    measure = str_replace(measure, " ?1/2", ".5"),
    measure = str_replace(measure, " ?3/4", ".75"),
    measure = str_replace(measure, " ?1/4", ".25"),
    # parse the number out of the measure
    measure_number = parse_number(measure),
    measure_number = if_else(str_detect(measure, "dash$"),
      measure_number / 50,
      measure_number
    )
  ) 
cocktails_parsed
## # A tibble: 3,643 x 7
##    name    category   row_id ingredient_numb… ingredient  measure measure_number
##    <chr>   <chr>       <dbl>            <dbl> <chr>       <chr>            <dbl>
##  1 Gauguin Cocktail …      1                1 rum         2 oz              2   
##  2 Gauguin Cocktail …      1                2 passion fr… 1 oz              1   
##  3 Gauguin Cocktail …      1                3 lemon juice 1 oz              1   
##  4 Gauguin Cocktail …      1                4 lime juice  1 oz              1   
##  5 Fort L… Cocktail …      2                1 rum         1.5 oz            1.5 
##  6 Fort L… Cocktail …      2                2 sweet verm… .5 oz             0.5 
##  7 Fort L… Cocktail …      2                3 orange jui… .25 oz            0.25
##  8 Fort L… Cocktail …      2                4 lime juice  .25 oz            0.25
##  9 Apple … Cordials …      3                1 apple schn… 3 oz              3   
## 10 Apple … Cordials …      3                2 cinnamon s… 1 oz              1   
## # … with 3,633 more rows

Let’s have another look at the most frequent ingredients:

cocktails_parsed %>%
  count(ingredient, sort = TRUE)
## # A tibble: 405 x 2
##    ingredient       n
##    <chr>        <int>
##  1 lemon juice    311
##  2 rum            184
##  3 lime juice     180
##  4 gin            176
##  5 bitters        166
##  6 whiskey        138
##  7 orange juice   124
##  8 simple syrup   115
##  9 vodka          114
## 10 dry vermouth   107
## # … with 395 more rows
num_ingreds <- cocktails_parsed %>% distinct(ingredient) %>% count()

We still have 405 of them, that’s too many! Let’s filter for ingredients that appear at least 15 times:

cocktails_parsed_filtered <- cocktails_parsed %>% 
  add_count(ingredient) %>% # get a count of ingredients so we can filter out rare ones
  filter(n >= 15) %>%  # ingredient must appear at least 15 times
  select(-n) %>%
  distinct(row_id, ingredient, .keep_all = TRUE) %>%    # to stop repetitions e.g. if we had lemon slice and lemon juice in a recipe then we will end up with 2 'lemon's as per our mutate() calls earlier
  na.omit()

num_ingreds <- cocktails_parsed_filtered %>%
  distinct(ingredient) %>% 
  summarise(n = n()) %>% 
  as.integer()

num_ingreds
## [1] 39

So we now have 39 different ingredients in our dataset. Time for some exploratory analysis.

Exploratory Data Analysis

Let’s take another look at the top ingredients, this time in graph form:

cocktails_parsed_filtered %>% 
        count(ingredient, sort = TRUE) %>% 
        head(20) %>% 
        mutate(ingredient = fct_reorder(ingredient, n)) %>% 
        ggplot(aes(n, ingredient)) + 
        geom_col() +
        labs(title = "Most common ingredients in cocktail recipes")

This looks much better. Unsurprisingly lemon juice is at the top, while the top alcoholic ingredient is gin closely followed by rum. Bitters, whiskey and vodka are also very popular, which makes sense if you frequent bars quite often –like I do–.

Let’s see how these numbers are broken down by another of the variables, category of cocktail:

n_recipes <- n_distinct(cocktails_parsed_filtered$name)  # get the total no.recipes

cocktails_parsed_filtered %>% 
        mutate(category = fct_lump_n(category, 8),  # Get the 6 most frequent categories and lump the rest into 'Other'
               ingredient = fct_lump_n(ingredient, 20)) %>% 
        filter(ingredient != "Other") %>% 
        count(category, ingredient, sort = TRUE) %>%  # D. Robinson included this line a few lines too early, e.g. vodka was erroneously being included in 'Other' 
        mutate(ingredient = fct_reorder(ingredient, n, sum)) %>% 
        ggplot(aes(n/n_recipes, ingredient, fill = category)) +   # turn x-axis into %
        geom_col() +
        scale_x_continuous(labels = percent_format()) +
        scale_fill_brewer(palette = "Set1") + # change default colour scheme
        labs(title = "Most common ingredients in cocktail recipes",
             x = "% of all recipes",
             y = "Ingredient",
             fill = "Category")

I think this break-down by category isn’t particularly insightful or reliable - for example, most drinks containing gin aren’t in the ‘Gin’ category. The rather vague ‘Cocktail Classics’ category seems to include plenty of drinks which could be placed in other categories. Perhaps a better way to organise this dataset would have been to allow multiple categories per drink?…

Co-occurrence of ingredients

Let’s look at another question now: Which ingredients tend to appear together in the same recipe?

First we calculate correlation between ingredients:

library(widyr) 
library(tidytext)

ingredient_pairs <- cocktails_parsed_filtered %>% 
        add_count(ingredient) %>%   # like group_by then count
        filter(n >= 5) %>% # filter only for ingredients that appear in min. 10 recipes
        pairwise_cor(ingredient, name, sort = TRUE)

ingredient_pairs
## # A tibble: 1,482 x 3
##    item1          item2          correlation
##    <chr>          <chr>                <dbl>
##  1 whole egg      powdered sugar       0.361
##  2 powdered sugar whole egg            0.361
##  3 port           powdered sugar       0.278
##  4 powdered sugar port                 0.278
##  5 dry vermouth   gin                  0.276
##  6 gin            dry vermouth         0.276
##  7 egg yolk       powdered sugar       0.251
##  8 powdered sugar egg yolk             0.251
##  9 lime juice     rum                  0.251
## 10 rum            lime juice           0.251
## # … with 1,472 more rows

Now let’s see which ingredients are most closely correlated with some of the most popular ones:

ingredient_pairs %>% 
        filter(item1 %in% c("lemon juice", "gin", "rum", "whiskey", "vodka")) %>% 
        group_by(item1) %>% 
        top_n(8, correlation) %>% 
        ungroup() %>%   # wasn't included in D. Robinson's code, but is needed
        mutate(item2 = reorder_within(item2, by = correlation, within = item1)) %>%  # need the 'within' argument for faceting
        ggplot(aes(correlation, item2)) +
        geom_col() +
        facet_wrap(~ item1, scales = "free_y") +
        scale_y_reordered() +
        labs(title = "Which ingredients are most correlated with particular ingredients?",
             y = "Most correlated ingredients")

These plots also make sense on inspection: gin is often found with vermouth, which is how you make martinis and negronis; lemon juice is often found with sugar/syrup to balance out the acidity; while rum and lime juice is a classic combination as are vodka/cranberry and whiskey/bitters.

I didn’t realise that gin is often combined with apricot-flavoured brandy but further investigation reveals that there are 18 such drinks!:

cocktails_parsed_filtered %>% 
  group_by(name) %>%  
  filter(all(c("gin", "apricot flavored brandy") %in% ingredient)) %>% 
  arrange(name) %>% 
  distinct(name)
## # A tibble: 18 x 1
## # Groups:   name [18]
##    name                    
##    <chr>                   
##  1 Apricot Anise Collins   
##  2 Apricot Anisette Collins
##  3 Apricot Cocktail        
##  4 Bermuda Rose            
##  5 Fairy Belle Cocktail    
##  6 Favorite Cocktail       
##  7 Favourite Cocktail      
##  8 Frankenjack Cocktail    
##  9 Golden Dawn             
## 10 K. G. B. Cocktail       
## 11 K.G.B. Cocktail         
## 12 Resolute Cocktail       
## 13 Webster Cocktail        
## 14 Wembley Cocktail        
## 15 Wembly Cocktail         
## 16 Western Rose            
## 17 What The Hell           
## 18 Why Not?

Network analysis

R has a couple of nice packages for displaying network graphs, ggraph and igraph. Let’s use them to plot the ‘connectedness’ of our data. Although not formally an unsupervised learning technique this should give us an idea as to what kind of clusters exist in the data.

library(ggraph)
library(igraph)
library(ggrepel)

ingredients_summarized <- cocktails_parsed_filtered %>%         
        group_by(name) %>% 
        mutate(percentile = row_number() / n()) %>%  # Tells us relative position of ingredient in drink - accounts for fact that some recipes have many ingredients and some don't
        ungroup() %>% 
        group_by(ingredient) %>% 
        summarize(n = n(),
                  avg_position = mean(percentile), # get average position in drink of ingredient
                  avg_serving = mean(measure_number, na.rm = TRUE)) %>%  # get average serving measure of ingredient
        arrange(desc(n)) 

# Get top 70 correlated ingredients
top_cors <- ingredient_pairs %>% head(70)  

# Filter ingredients_summarised for only highly correlated ingredients
ingredient_info <- ingredients_summarized %>% 
        filter(ingredient %in% top_cors$item1) %>% 
        distinct()

# Generate network graph
top_cors %>% 
        graph_from_data_frame(vertices = ingredient_info) %>% 
        ggraph(layout = "fr") +
        geom_edge_link() +
        geom_node_text(aes(label = name), repel = TRUE) +
        geom_node_point(aes(size = 1.1* n)) +        
        geom_node_point(aes(size = n, color = avg_position)) +
        scale_color_gradient2(low = "blue", high = "red", midpoint = .8,
                              labels = scales::percent_format()) +
        labs(size = "# of recipes",
             color = "Avg position in drink",
             title = "Network of cocktail ingredients",
             subtitle = "Connected ingredients tend to appear in the same drink. A bluer colour indicates that it tends to appear earlier in drinks.")

This is a nice plot! As before, we see popular combinations of ingredients (e.g. whiskey/bitters, gin/vermouth, lemon juice/sugar) but now we can see the full picture. Unsurprisingly the spirits tend to be quite far apart (except for gin and brandy) and surrounded by flavouring agents such as juices and soft liqueurs. The bluer colour of these spirits show that they tend to be earlier in the drink recipes.

Principal component analysis

Finally, we can try a more formal unsupervised learning technique: principal components analysis (PCA). For the sake of brevity I won’t explain PCA here for the uninitiated, but I will summarise its purpose:

“When faced with a large set of correlated variables, principal components allow us to summarize this set >with a smaller number of representative variables that collectively explain most of the variability in the >original set…PCA also serves as a tool for data visualization.”

— An Introduction to Statistical Learning, James/Witten/Hastie/Tibshirani

In other words

cocktails_df <- cocktails_parsed_filtered %>%
  select(-ingredient_number, -row_id, -measure) %>%
  pivot_wider(names_from = ingredient, values_from = measure_number, values_fill = 0) %>%
  janitor::clean_names() %>% 
  na.omit()

cocktails_df
## # A tibble: 952 x 41
##    name  category   rum lemon_juice lime_juice sweet_vermouth orange_juice
##    <chr> <chr>    <dbl>       <dbl>      <dbl>          <dbl>        <dbl>
##  1 Gaug… Cocktai…  2           1          1               0           0   
##  2 Fort… Cocktai…  1.5         0          0.25            0.5         0.25
##  3 Cuba… Cocktai…  2           0          0.5             0           0   
##  4 Cool… Cocktai…  1.5         0          0               0           1   
##  5 John… Whiskies  0           1          0               0           0   
##  6 Cher… Cocktai…  1.25        0          0               0           0   
##  7 Casa… Cocktai…  2           0          1.5             0           0   
##  8 Cari… Cocktai…  0.5         0          0               0           0   
##  9 Ambe… Cordial…  0           0.25       0               0           0   
## 10 The … Whiskies  0           0.5        0               0           0   
## # … with 942 more rows, and 34 more variables: powdered_sugar <dbl>,
## #   cranberry_juice <dbl>, pineapple_juice <dbl>, whiskey <dbl>,
## #   simple_syrup <dbl>, cherry_flavored_brandy <dbl>, light_cream <dbl>,
## #   triple_sec <dbl>, maraschino <dbl>, amaretto <dbl>, grenadine <dbl>,
## #   apple_brandy <dbl>, brandy <dbl>, gin <dbl>, anisette <dbl>,
## #   dry_vermouth <dbl>, apricot_flavored_brandy <dbl>, sloe_gin <dbl>,
## #   bitters <dbl>, benedictine <dbl>, egg_white <dbl>, half_and_half <dbl>,
## #   vodka <dbl>, grapefruit_juice <dbl>, port <dbl>,
## #   white_creme_de_cacao <dbl>, yellow_chartreuse <dbl>,
## #   citrus_flavored_vodka <dbl>, whole_egg <dbl>, egg_yolk <dbl>,
## #   dubonnet <dbl>, blanco_tequila <dbl>, old_mr_boston_dry_gin <dbl>,
## #   old_mr_boston_five_star_brandy <dbl>
library(tidymodels)

pca_rec <- recipe(~., data = cocktails_df) %>%
  update_role(name, category, new_role = "id") %>%
  step_normalize(all_predictors()) %>%   # centers and scales all predictors
  step_pca(all_predictors())

pca_prep <- prep(pca_rec) 

tidied_pca <- tidy(pca_prep, 2)

tidied_pca %>%
  filter(component %in% paste0("PC", 1:3)) %>%
  mutate(component = fct_inorder(component)) %>%
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~component, nrow = 1) +
  labs(y = NULL)

library(tidytext)

tidied_pca %>%
  filter(component %in% paste0("PC", 1:4)) %>%
  group_by(component) %>%
  top_n(8, abs(value)) %>%
  ungroup() %>%
  mutate(terms = reorder_within(terms, by = abs(value), within = component)) %>%  # from tidytext package - reorders column before plotting with faceting, such that the values are ordered within each facet...needs scale_x_reordered or scale_y_reordered to be used later
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col() +
  facet_wrap(~component, scales = "free_y") +
  scale_y_reordered() +  # see earlier comment
  labs(
    x = "Absolute value of contribution",
    y = NULL, fill = "Positive?"
  )